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Abstract 

In this paper, we propose a method to adapt 
a general parser (Link Parser) to sublanguages, 
focusing on the parsing of texts in biology. Our 
main proposal is the use of terminology (iden- 
tification and analysis of terms) in order to re- 
duce the complexity of the text to be parsed. 
Several other strategies are explored and finally 
combined among which text normalization, lex- 
icon and morpho-guessing module extensions 
and grammar rules adaptation. We compare 
the parsing results before and after these adap- 
tations. 



In this paper, we first discuss the question of 
sublanguages and the different strategies that can 
be adopted to parse technical texts. Section |31 
presents the context of the adaptation of the LP 
to the biological domain. In section 01 we anal- 
yse several cases of parsing failure along with the 
solutions we propose to adapt the parser. We fi- 
nally present the evaluation of the modifications 
we made on the LP grammar and lexicon. 

2 Previous works 

Sublanguages have been studied for a long time 
even though it remains a rather confidential part 
of linguistic and NLP studies. It is noticeable 
that in specific domains of knowledge, among cer- 
tain communities and in particular types of texts, 
people have their own way of writing. These 
specific languages are called either sublanguages 
(|Harris et al. 891 Grishman &: Kittredge 86 1 , re- 
stricted or specialized languages depending on the 
fact that one focuses on the continuity or the 
gap between these languages and the "usual lan- 
guage". In fact, a sublanguage is a restricted 
(fewer lexicon items and semantic classes) as well 
as a deviant language (original lexicon items and 
phrasings). This is also noticeable from a dis- 
tributional point of view. As Harris noticed it, 
a sublanguage can be characterized by its selec- 
tional restrictions and more generally by the dis- 
tribution of lexicon items and syntactic patterns. 

()Sekine 97|) has argued that parsing should be 
domain dependent. Three alternative approaches 
can be considered. Several NLP teams have de- 
cided to develop a specialized parser for a given 
sublanguage (see for instance the String project 
( Sager et al. 87D or ( |Pustejovsky et al. 02) )) but 
this approach is considered too expensive for 
many applications. A second track consists in 

^ExtraPloDocs website : training a grammar from a specialized corpus, 

[http://www-lipn.univ-parisl3.fr/RCLN/Extra/ExtraPloDocs/^{5sj^ requires annotated corpora that are rare in 
results are also exploited for the development of special- • i. i i ■ a ■ , i- ; i 

ized search engines in the ALVIS project (STREP) : specialized domains. An intermediate approach 
|http://cosco.hiit.fi/search/alvis.html| aims at manually adapting a parser as proposed 



1 Introduction 

Most available NLP tools are developed for gen- 
eral language while processing technical texts, i.e. 
sublanguages, becomes a necessity for various ap- 
plications like extracting information from biolog- 
ical texts (see (|Grishm an 01|) ,( |Pyysalo et al. 04 ), 
(|Grover et al n4|) and ( |Yakushiji et al. 05D ). In 
order to assist the biologists in their daily biblio- 
graphical work, the ExtraPloDocs project^ devel- 
ops the natural language processing and machine 
learning tools that enable to build focused in- 
formation extraction systems in genomics (gene- 
protein interaction, gene fonctionalities, gene ho- 
mologies, etc.) at a reasonable cost. Beyond key- 
word and statistics based approaches, extracting 
such relational information must be based on syn- 
tax to achieve good precision and coverage (see 
for instance ( Ding et al. 03D ). We therefore need 
a reliable syntactic parsing of the texts dealing 
with genomics. 

Instead of redeveloping new parsers for each 
sublanguage, we try to define a method for adapt- 
ing a general parser to a specific sublanguage. 
This paper presents a strategy to adapt the Link 
Parser (LP) dSleator fc Temperley 9l1 ) to parse 
Medline abstracts dealing with genomics. 

website : 



in ( Pyysalo et al. 04 ). This is our approach. This 
work can be considered as a preUminary work 
to evaluate the potentiahties of automating this 
adaptation. 

Two different approaches have been explored 
for the parsing evaluation. The first is linguis- 
tically oriented and based on test suites, a set 
of sentences that illustrates the various syntactic 
structures that a parser is supposed to analyse like 
in TSNLP (jLehman 96|1 . The second approach, 
more pragmatic and more common, consists in 
evaluating the performances of a parser on a given 
corpus supposed to be representative of the tex- 
tual data to parse. We will show in the following 
that we adopted a mixed approach. 

As we will see below, one of the main prob- 
lems in parsing sublanguages is the ambiguity of 
prepositional attachment. 

3 Context 

3.1 The corpora 

Three different corpora were built from Med- 
line^ abstracts (in English) dealing with tran- 
scription in Bacillus subtilis. As recommended 
by ()Prasad fc: Sarkar 00|1 and ()Srinivas et al. 98|l . 
we mixed the two evaluation standards by ran- 
domly selecting 212 sentences that we organized 
according to their linguistic specificities. Despite 
its relatively small size, the med-test corpus is 
a good sample of the sublanguage of genomics. 
We also used a larger corpus of full abstracts 
(transcript, 16,981 sentences, 434,886 words) and 
the GIEC corpus made of 160 sentences express- 
ing gene/protein interactions. The giec corpus 
was built and used benchmark corpus in the 
context of the Genie Interaction Extraction Chal- 
lenge^ joint to the ICML 2005. 

3.2 The initial parser choice 

In the context of our IE task, and particularly 
for the ontology acquisition, we need reliable 
and precise syntactic relations between the words 
of the whole sentence (except empty words). 
For those reasons, a symbolic dependency-based 
parser seemed to be the most adequate. 

LP presents several advantages among which 
the robustness, the good quality of the pars- 
ing, the adequation of the dependency technique 
and representation with our IE task and the 



declarative format of its lexicon. From the re- 
sults of the evaluation that we did on different 
parsers with the MED-TEST corpus, it turned out 
that dependency-based parsers have better re- 
sults on long and complex sentences, particularly 
with coordinations. This conclusion is shared by 



^http:/ /www. ncbi.nlm.nih.gov/entrez/query.fcgi 
^http:/ /genome.jouy.inra.fr/texte/LLLchallenge 



(Ding et al. 03) who also worked on Medline ab- 
stracts. Other experiments, in the context of the 
ExtrAns project ()Molla et al. 00|) . showed that 
76% of 2,781 sentences from a Unix manpage cor- 
pus were completely parsed by LP with no regard 
to the parsing quality, while we reach only 54% 
on the biological corpus. When looking at the 
quality of the parses, we noticed different kinds 
of errors depending either on the biological do- 
main or on more general linguistic difficulties like 
ambiguous constructions. We propose three solu- 
tions to address these issues, the text normaliza- 
tion, the use of terminology and the adaptation 
of the lexicon/grammar of LP. 

4 Diagnosis and adaptation 

Our analysis of the performance of the Link gram- 
mar on the biological corpus confirms previous 
works. The main problems can be classified along 
the following axes. 

4.1 "Textual noise" 

Scientific texts present particularities that we 
chose to handle in a normalization step prior to 
the parsing. First, the segmentation in sentences 
and words was taken off from the parser and en- 
riched with named entities recognition and rules 
specific to the biological domain. We also delete 
some extratextual information that alter the pars- 
ing quality. Finally, we use dictionaries and trans- 
ducers to replace genes and species names by two 
codes, which prevents from extending the LP dic- 
tionary too much. 

4.2 Unknown words 

In the TRANSCRIPT corpus, we identified 6,005 
out-of-lexicon forms (45,804 occurences) among 
12,584 distinct words, i.e. 47.72%. They are 
mostly latin words, numbers, DNA sequences, 
gene names, misspellings and technical lexicon. 

However, LP includes a module that can assign 
a syntactic category to an unknown word. It is 
based on the word suffix. Modifying the morpho- 
guessing (MG) module seemed a better strategy 
than extending the dictionary since biological ob- 
jects differ from an organism to another. We then 



created 19 new MG classes for nouns {-ase, -ity, 
etc.) and adjectives {-al, -ous, etc.) along with 
their rule. 

In the same time, we added about 500 words of 
the biological domain to the LP lexicon in differ- 
ent classes, mainly nouns, adjectives and verbs. 

4.3 Specific constructions 

Some words already defined in the LP lexi- 
con present a specific usage in biological texts, 
which implied some modifications including mov- 
ing words from one class to another and adaptat- 
ing or creating rules. 

The main motivation for moving words from 
one class to another is that the abstracts are writ- 
ten by non-native English speakers. This point 
was also raised by (Pyysalo et al. 04 >. One way 



to allow the parsing of such ungrammatical sen- 
tences is to relax constraints by moving some 
words from the countable to the mass-countable 
class for instance. 

Some very frequent words present idiosyncratic 
uses (particular valency of verbs for instance), 
which induced the modification or creation of 
rules. Numbers and measure units are om- 
nipresent in the corpus and were not necessar- 
ily well described or even present in the lexi- 
con/grammar. Other minor changes were made 
that are not mentioned in this paper. 

4.4 Structural ambiguity 

We identified two cases of ambiguity that can be 
partially resolved by using terminology. 

Prepositional attachment is a tricky point 
that is often fixed using statistical informa- 
tion from the text itself ()Hindle &: Rooth 931 
Fabre &: Bourigault Olj ), a larger corpus 
dBourigault fc i^rerot 04"| ), the web (|Volk 021 



IGala Pavia 03|1 or an external resources such 
as WordNet ( Stetina Hz Nagao 97 ). The second 
major ambiguity factor is the attachment of series 
of more than two nouns. As shown in Figure Q 
neither a parallel attachment (Ip) nor a serial one 
(Ip-bio) seem to be satisfying. We noticed that 
such cases often appear inside larger nominal 
phrases often corresponding to domain specific 
terms. For this reason, we decided to identify 
terms in a pre-processing step and to reduce them 
to their syntactic head. If needed, the internal 
analysis of terms is added to the parsing result 
for the simplified sentence (see Ip-bio-t). The 
strategy proposed by (jSutcliffe et al. 95)) that 



a) in parallel attachment (Ip) 



two-component signal transduction systems 



b) in series attachment (Ip-bio) 

^ -AN -J AN ^ — AN — J 

two-component signal transduction systems 



correct link 

erroneous link 



c) correct attachment (Ip-bio-t) 



two-component signal transduction systems 

Figure 1: Series of nouns dependencies 

consists in the linkage of the words contained in 
a compound (for instance "sporulation_process") 
was excluded. It makes the lexicon size augment 
and does not reduce complexity for reasons due 
to the implementation of LP. 

Figure El shows the influence of the adaptation 
on the parsing with the fixing of a segmentation 
error and the disambiguation of prepositional and 
nominal attachements. 

Before practically integrating the use of termi- 
nology in our processing suite, we made a simu- 
lation of this simplification of terms. 

5 Evaluation 

We performed a two-stage evaluation of the mod- 
ifications in order to measure the respective con- 
tribution of the LP adaptation on the one hand 
and of the term simplification on the other hand. 

5.1 Corpus and criteria 

We used a subset (10 files^) of the med-test corpus 
but, contrary to the first evaluation (choice of a 
parser), we wanted to look at the quality of the 
whole parse and not only to specific relations. 

Tabled (for the med-test subset) shows the way 
that out-of-lexicon words (OoL), i.e. unknown 
(UW) and guessed (GW) words, are handled 
by giving the percentage of incorrect morpho- 
syntactic category assignations with the original 
resources (Ip), those adapted to biology (Ip-bio) 
and finally the latter associated with the simpli- 
fication of terms (Ip-bio-t). 

In Table [21 five criteria inform on the parsing 
time and quality for each sentence : the number 
of linkages (NbL), the parsing time (PT) in sec- 
onds, the fact that a complete linkage is found or 
not (CLF), the number of erroneous links (EL) 
and the quality of the constituency parse (CQ). 



141 sentences, 2630 words 



a) Ip parse 



f- AN J Mp * 
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. protein. n produced. v in the mother.n cell.n compartment. n from T(2 [)] of sporulation.n . 



b) lp~bio-t: Ip-bio parse with terms sirr^plification 

Mp 

IVlT^ J.S 1 



correct link 

erroneous link 

\rords deleted by simplification 



j-HVIV 



Ip-bio parse 



. protein. n produced. v in the mother cell compailment.n from Tf2) of spontlation 

i-AN-i-AN-i i-Mp-i-Js-i } terms analyser 



Figure 2: Example of parsing 





Ip 


Ip-bio 


Ip-bio-t 




a 


b 


a 


b 


a 


b 


uw 


244 


41.4% 


53 


52.8% 


26 


19.2% 


GW 


24 


4.2% 


72 


0% 


31 


0% 


OoL 


268 


38% 


125 


22.4% 


57 


8.8% 



a : total MS assignations, b : % of incorrect assigrrations 

Table 1: Incorrect MS category assignations 



(NbW) is the average number of words in a sen- 
tence which varies with the term simplification. 
The results are given for each one of the three 
versions of the parser. 

UW, GW, NbL, PT and CLF are objective data 
while EL and CQ necessitate a linguistic exper- 
tise. The CQ evaluation consisted in the assigna- 
tion of a general quality score to the sentence. 

5.2 Results and comments 

The extension of the MG module reduced the 
number of erroneous morpho-syntactic category 
assignations (see Table from 38% to 22.4%. 
61% of the sentences where one or more assig- 
nation error was corrected by the MG module 
actually have better parsing results (15% have 
been degraded). More generally, the increase 
of guessed forms makes the category assignation 
more reliable. 

The extension of the lexicon and the nor- 
malization of genes and species names dis- 
charged the two modules from 143 assignations 
out of 268, 50 of which were wrong. 64% of the 
sentences where one or more assignation error was 
corrected by the extension of lexicon have bet- 
ter parsing results (18% of the sentences were de- 
graded) . 

The effect of the rules modification and cre- 
ation is difficult to evaluate precisely though it is 
certain to play a part in the parsing improvement, 
especially the relaxing of constraints on determin- 





Ip 


Ip-bio 


Ip-bio-t 


crit. 


avg 


avg 


%/lp 


avg 


%/lp 


NbW 


24.05 


24.05 


100% 


18.9 


78.6% 


NbL 


190,306 


232,622 


122.2% 


1,431 


0.75% 


PT 


37.83 


29.4 


77.7% 


0.53 


1.4% 


CLF 


0.54 


0.72 


133% 


0.77 


142.6% 


EL 


2.87 


1.91 


66.5% 


1.15 


40.1% 


CQ 


0.54 


0.7 


129.6% 


0.8 


148.1% 



Table 2: Parsing time and quality 



ers and inserts. 

The most obvious contribution to the better 
parsing quality is the one of the term simpli- 
fication. The drastic reduction in parsing time 
and number of linkages gives an idea of the re- 
duction of complexity. It is not only due to the 
smaller number of words since the number of er- 
roneous links is reduced of 60% while the number 
of words is reduced of only 21.4%. This confirms 
previous similar studies that showed a reduction 
of 40% of the error rate on the main syntactic 
relations with a French corpus. 

Remaining errors are mainly due to four dif- 
ferent phenomena. First, the normalization step, 
prior to the parsing, needs to be enhanced. Con- 
cerning LP, there are still lexicon gaps, wrong 
class assignations and a still unsatisfactory han- 
dling of numerical expressions. In addition, and 
like (jSutclitfe et al 951) . we identified a weakness 
of LP regarding coordination. A specific study of 
the coordination system in LP and in the biologi- 
cal texts may be necessary. Finally, some ambigu- 
ous nominal and prepositional attachments still 
remain in spite of the term simplification. These 
may be resolved in a post-processing step like in 
ExtrAns that uses a corpus based approach to re- 
trieve the correct attachment from the different 
linkages given by LP for a sentence. 

Other questions like the feeding of LP with a 



morpho-syntactically tagged text or the ameliora- 
tion of the parse ranking in LP were not discussed 
in this paper but are interesting issues that we in- 
tend to study. 

6 Conclusion 

Since parsing is domain and language dependent, 
a general parser must be adapted to each given 
sublanguage. In the context of an IE project in 
biology, we have adapted the Link Parser to anal- 
yse the specific language of Medline abstracts in 
genomics. Our initial diagnosis mainly raised two 
different problems which are traditional in sub- 
language analysis: the lack of lexical coverage and 
the structural ambiguity, especially in the cases of 
prepositional phrase attachments. 

We showed that the lexical problem can be 
manually handled by introducing new words in 
the lexicon and by extending the morpho-gucssing 
module. We also proposed to distinguish and 
combine terminological and syntactic analysis. 
In the same way as the morpho-syntactic tag- 
ging should be considered independently from the 
parsing, we argue that the terminology analysis 
must be handled separately. This represents the 
main automated part of the adaptation task. The 
use of terminology to alleviate the parsing task is 
relevant and applicable in the context of domain 
specific texts processing since terminology tools 
and lists of terms are generally available. It also 
reduces the part of effective modification of the 
lexicon/grammar of the parser. This first evalua- 
tion has shown promising results. 

This work has been developed as part of the 
ExtraPloDocs (extraction of gene-protein interac- 
tions in Medline abstracts) and ALVIS projects. 
We have shown that combining the terminological 
and syntactic analysis has an important impact 
on the resulting parses because the terminologi- 
cal analysis simplifies the parser input. 
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